Dev chunk optimization postprocessveppanel by migrau · Pull Request #390 · bbglab/deepCSA

migrau · 2025-11-14T10:39:44Z

[copilot generated]

Performance Optimization: Chunked Processing for Large Panel Annotations

Overview

This PR introduces memory-efficient chunked processing for VEP annotation post-processing, enabling the pipeline to handle arbitrarily large panel annotations without memory constraints.

Changes Summary

✅ Implemented Chunking Optimizations

1. `panel_postprocessing_annotation.py` - Chunked VEP Output Processing

Chunk size: 100,000 lines
Implementation: Streaming pandas read with incremental output writing
Benefits:
- Processes large VEP outputs without loading entire file into memory
- Prevents OOM errors on panels with millions of variants
- Maintains same output quality with predictable resource usage

Technical details:

chunk_size = 100000
reader = pd.read_csv(VEP_output_file, sep="\t", chunksize=chunk_size)

for i, chunk in enumerate(reader):
    processed_chunk = process_chunk(chunk, chosen_assembly, using_canonical)
    # Incremental write with header only on first chunk
    rich_out_file.write(processed_chunk.to_csv(header=(i == 0), index=False, sep="\t"))
    del processed_chunk
    gc.collect()  # Explicit memory cleanup

Process: CREATEPANELS:POSTPROCESSVEPPANEL

Takes per-chromosome output from VCFANNOTATEPANEL
Processes in 100k-line chunks
Status: ✅ Working successfully

2. `panel_custom_processing.py` - Chromosome-Based Chunked Loading

Chunk size: 1,000,000 lines
Strategy: Load only relevant chromosome data in chunks
Benefits:
- Memory-efficient custom region annotation
- Filters during read to minimize memory footprint

Technical details:

def load_chr_data_chunked(filepath, chrom, chunksize=1_000_000):
    reader = pd.read_csv(filepath, sep="\t", chunksize=chunksize, dtype={'CHROM': str})
    chr_data = []
    for chunk in reader:
        filtered = chunk[chunk["CHROM"] == chrom]
        if not filtered.empty:
            chr_data.append(filtered)
    return pd.concat(chr_data) if chr_data else pd.DataFrame()

Process: CUSTOMPROCESSING / CUSTOMPROCESSINGRICH

Processes custom genomic regions with updated annotations
Loads data per-chromosome to reduce memory usage

❌ VEP Cache Storage Location - No Performance Impact

What was tested:

Using VEP cache from beegfs storage (/workspace/datasets/vep or /data/bbg/datasets/vep)
Expected faster cache access vs. downloading on-the-fly

Results:

No significant runtime improvement for ENSEMBLVEP_VEP process
VEP annotation runtime is compute-bound, not I/O-bound
Network-attached storage performed equivalently to local cache
OS filesystem caching likely mitigates storage location differences

Commits:

035a0c7 (April 3, 2025): Added VEP cache beegfs support
8e40d83 (April 24, 2025): Removed VEP cache beegfs optimization (no benefit)

Current approach:

Cache location configurable via params.vep_cache
Defaults to downloading cache if not provided
Various config files specify beegfs paths for convenience, not performance

Resource Configuration

Updated resource limits for chunked processes:

withName: '(BBGTOOLS:DEEPCSA:CREATEPANELS:POSTPROCESSVEPPANEL*|...)' {
    cpus   = { 2 * task.attempt }
    memory = { 4.GB * task.attempt }
    time   = { 360.min * task.attempt }
}

Integration Points

Affected Subworkflows:

CREATEPANELS → POSTPROCESSVEPPANEL → processes VEP output in chunks
CUSTOMPROCESSING / CUSTOMPROCESSINGRICH → uses chunked loading for custom regions

Pipeline Flow:

SITESFROMPOSITIONS → VCFANNOTATEPANEL (VEP) 
    ↓
POSTPROCESSVEPPANEL (chunked processing) ← 100k line chunks
    ↓
CUSTOMPROCESSING (optional, chunked by chromosome)
    ↓
CREATECAPTUREDPANELS / CREATESAMPLEPANELS / CREATECONSENSUSPANELS

Testing

Tested on:

Large-scale panels (millions of variants)
Multiple configuration profiles (nanoseq, chip, kidney, etc.)

Validation:

Output correctness verified (same results as non-chunked version)
Memory usage remains stable across panel sizes
No OOM errors on large inputs

Performance Impact

Metric	Before	After
Memory usage	Unbounded (full file in RAM)	~4 GB (controlled)
Max panel size	Limited by available memory	Unlimited
Runtime	Similar	Similar (no regression)
Reliability	OOM on large panels	Stable processing

Migration Notes

No breaking changes. Existing pipelines continue to work with improved memory efficiency.

Related Commits

276152d: Chunking for panel_custom_processing.py
035a0c7: VEP cache beegfs attempt (added)
8e40d83: VEP cache beegfs removal (no performance gain)
Various fixes: 1dffd94, 945c129, d243ebc, etc. (resource tuning)

Conclusion

This PR successfully implements memory-efficient chunked processing for panel annotation post-processing, enabling the pipeline to scale to arbitrarily large panels without memory constraints. The VEP cache storage location experiment confirmed that computation, not I/O, is the bottleneck for annotation runtime.

…y using polars

…. Upgrade pybedtools. Added wave

…olving conflicts.

…ostprocessing_annotation.py. panel_postprocessing_chunk_size deleted.

…nd within-chrom chunking optional

m-huertasp · 2026-03-20T08:33:04Z

Hi! While checking the cord bloods run (combining DupCaller and deepUMI callings) I saw that one of the places in which we have a bottleneck is in the postprocessveppanel and I found this PR. I've checked the python script and I think it could also be a bit more optimized by using polars instead of pandas (now that we have it in the deepCSA container) and trying to change the "apply" logic in some places. Just a "heads-up" to say that it could be done in the future so I don't forget.

FerriolCalvet

Looks good Miguel!

I left some comments and suggestions.
nothing critical.

only some minor fixes to pass the nextflow linting
update default chunk_size to 1M so that bigger panels get chunked

and then another comment is that maybe we might need to be more generous in terms of memory of some steps when running bigger cohorts, but we will see this as we start using it.

I would apply the suggestions if you agree and then merge it to dev so that it starts to get tested by all of us and we tune it from there

thankss!!

FerriolCalvet · 2026-03-20T10:41:42Z

-    label 'process_single'

    conda "python=3.10.17 bioconda::pybedtools=0.12.0 conda-forge::polars=1.30.0 conda-forge::click=8.2.1 conda-forge::gcc_linux-64=15.1.0 conda-forge::gxx_linux-64=15.1.0"
    container 'docker://bbglab/deepcsa_bed:latest'


I think that the receipe of this container is not pushed to https://github.com/bbglab/containers-recipes
if you have it localized somewhere try to push it so that we have everything centralized there, but go ahead with the merge

FerriolCalvet · 2026-03-20T11:45:21Z

Wait Miguel, we found some weird behaviour in the test run with the cord bloods I will let you know once we solve it

…ove chunking logic

…tion. Update omega snapshot with 2 decimals

migrau · 2026-03-22T06:25:53Z

the results from omega were not deterministic, e.g:

Now, the test rounds numeric columns (dnds, pvalue, lower, upper) to 2 decimals before comparison.

On the other hand, the implementation of polars in bin/panel_custom_processing.py is much faster than pandas but it requires ~30% of RAM, because there is not chunking. Results tested and compared with the previous implementation and they have the same md5sum.

…oved processing. Resources adjusted

…dd error handling

…g in VEP annotation and adjust related schema defaults. Linting fixes

…ading

…erformance

- Modified nextflow.config to include general reference paths and skip validation for specific parameters. - Increased resource limits for processes to accommodate VEP execution. - Changed panel_sites_chunk_size to 0 and disabled parameter validation. - Added new input_maf.csv file with sample and VCF path data for testing.

migrau · 2026-04-10T09:24:02Z

MAF tests from current dev added 330690e

Test passed. New snapshots match the ones from dev.

$ nf-test test 

🚀 nf-test 0.9.2
https://www.nf-test.com
(c) 2021 - 2024 Lukas Forer and Sebastian Schoenherr


Test DEEPCSA Pipeline

  Test [4150b7be] 'Minimal features test run' PASSED (677.383s)
  Test [9a4541aa] 'MAF-based minimal features test run' PASSED (1019.239s)
  Test [1e61cc82] 'Omega analysis test run' PASSED (651.906s)
  Test [c650506f] 'MAF-based omega analysis test run' PASSED (1006.037s)
  Test [b4f95eff] 'Mutation density test run' PASSED (652.338s)
  Test [89f4b5ca] 'MAF input validation - fails when --input_maf is provided without --use_custom_depths' PASSED (7.543s)


SUCCESS: Executed 6 tests in 4014.461s

(nf-test) [deepCSA] [mgrau@irbccn34 deepCSA]

migrau · 2026-04-10T12:30:00Z

merge FROM dev done and test passed:

$ nf-test test 

🚀 nf-test 0.9.2
https://www.nf-test.com
(c) 2021 - 2024 Lukas Forer and Sebastian Schoenherr

Load .nf-test/plugins/nft-utils/0.0.3/nft-utils-0.0.3.jar

Test DEEPCSA Pipeline

  Test [4150b7be] 'Minimal features test run' PASSED (625.354s)
  Test [9a4541aa] 'MAF-based minimal features test run' PASSED (1050.716s)
  Test [1e61cc82] 'Omega analysis test run' PASSED (665.905s)
  Test [c650506f] 'MAF-based omega analysis test run' PASSED (1057.473s)
  Test [b4f95eff] 'Mutation density test run' PASSED (619.715s)
  Test [89f4b5ca] 'MAF input validation - fails when --input_maf is provided without --use_custom_depths' PASSED (8.228s)


SUCCESS: Executed 6 tests in 4036.851s

We can go ahead with the merge to dev

migrau and others added 30 commits April 3, 2025 09:56

dev: VEP chunk and VEP cache beegfs

035a0c7

fix: use standard cache for ENSEMBLVEP_VEP

8ef2919

perf: improve VEP performance by converting input format

40bb507

fix: panel_postprocessing_annotation.py

bb21b25

fix: arguments safe_transform_context

7c73d3b

perf: chunking panel_custom_processing.py

276152d

perf: CREATECAPTUREDPANELS containers edited. create_panel_versions.p…

7bc3a16

…y using polars

fix: python3 container for CREATECAPTUREDPANELS

346665d

fix: remove container option CREATECAPTUREDPANELS. fix conda versions…

08d8fad

…. Upgrade pybedtools. Added wave

fix: typo CREATECAPTUREDPANELS

5c8ff55

fix: wave true only for CREATECAPTUREDPANELS

891ec85

fix: syntax config module CREATECAPTUREDPANELS

e1fd6af

fix: new way to specify wave for a single process

ca0ae01

fix: toString added for wave

5560c25

fix: wave label added

c0c3e97

fix: wave true for everything

24efcf6

fix: wave false except CREATECAPTUREDPANELS

7734938

fix: comma...

b625332

fix: wave removed. New container created

8110a34

fix: Removed wave from nextflow.config

e718e41

fix: adjust memory requeriments

9fd0ed7

perf: added new profile, nanoseq

abc85ed

fix: naming withLabel config review

3e0b4b5

fix: nanoseq config resourceLimits

61ec864

fix: correct withName *

0188172

fix: SITESFROMPOSITIONS memory test

b0e422a

fix SITESFROMPOSITIONS

63dcea7

fix: SITESFROMPOSITIONS

7c2f56b

fix: fix profile

6e53f23

fix: SITESFROMPOSITIONS config

e9d1b3b

FerriolCalvet linked an issue Jan 5, 2026 that may be closed by this pull request

Large memory usage by panel_postprocessing_annotation.py #407

Closed

FerriolCalvet added this to the Phase 2 milestone Jan 5, 2026

migrau added 3 commits March 18, 2026 07:26

Merge branch 'dev' into dev-chunk-optimization-POSTPROCESSVEPPANEL. S…

485978f

…olving conflicts.

fix: Refactor VEP annotation processing: revert chunking from panel_p…

de0463c

…ostprocessing_annotation.py. panel_postprocessing_chunk_size deleted.

feat: Review Ferriol comments. custom processing of the panel fixed a…

3179764

…nd within-chrom chunking optional

FerriolCalvet mentioned this pull request Mar 19, 2026

Add only parallel changes #427

Closed

migrau added 2 commits March 19, 2026 18:26

feat: Add mutation-specific QC plotting options and update tests

09a6f0c

add: Mutation density test

e40939a

migrau requested a review from FerriolCalvet March 19, 2026 18:54

FerriolCalvet reviewed Mar 20, 2026

View reviewed changes

FerriolCalvet mentioned this pull request Mar 20, 2026

disable automatic filling of 0s in profile #433

Open

migrau added 2 commits March 20, 2026 18:19

refactor: Replace pandas with polars for improved performance and rem…

e111a7e

…ove chunking logic

Remove custom processing chunk size parameter from nextflow configura…

85dcfb9

…tion. Update omega snapshot with 2 decimals

feat: Replace nanoseq configuration with exome configuration for impr…

73fbda3

…oved processing. Resources adjusted

This was linked to issues Mar 29, 2026

Adjust resource requirements of different processes to minimize burden for the cluster #262

Closed

Test if it works with the duplexome #161

Closed

Parallelize panel annotation steps by splitting the input into different chunks #114

Closed

migrau added 6 commits March 30, 2026 15:34

feat: Optimize resource allocation for panel creation processes and a…

d0c6176

…dd error handling

fix: Correct typo in process name for panel comparison configuration

fdbbdb5

feat: Update panel_sites_chunk_size to 1,000,000 for improved chunkin…

c503b8e

…g in VEP annotation and adjust related schema defaults. Linting fixes

feat: Add Protein_position to schema overrides for VEP output file lo…

3ecfb18

…ading

feat: Update resource allocation for analysis processes to optimize p…

835656f

…erformance

Merge branch 'dev' into dev-chunk-optimization-POSTPROCESSVEPPANEL n

a089e9a

FerriolCalvet merged commit cfa2a97 into dev Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev chunk optimization postprocessveppanel#390

Dev chunk optimization postprocessveppanel#390
FerriolCalvet merged 57 commits into
devfrom
dev-chunk-optimization-POSTPROCESSVEPPANEL

migrau commented Nov 14, 2025

Uh oh!

m-huertasp commented Mar 20, 2026

Uh oh!

FerriolCalvet left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FerriolCalvet Mar 20, 2026

Uh oh!

FerriolCalvet commented Mar 20, 2026

Uh oh!

migrau commented Mar 22, 2026

Uh oh!

migrau commented Apr 10, 2026

Uh oh!

migrau commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

migrau commented Nov 14, 2025

Performance Optimization: Chunked Processing for Large Panel Annotations

Overview

Changes Summary

✅ Implemented Chunking Optimizations

1. panel_postprocessing_annotation.py - Chunked VEP Output Processing

2. panel_custom_processing.py - Chromosome-Based Chunked Loading

❌ VEP Cache Storage Location - No Performance Impact

Resource Configuration

Integration Points

Testing

Performance Impact

Migration Notes

Related Commits

Conclusion

Uh oh!

m-huertasp commented Mar 20, 2026

Uh oh!

FerriolCalvet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FerriolCalvet Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

FerriolCalvet commented Mar 20, 2026

Uh oh!

migrau commented Mar 22, 2026

Uh oh!

migrau commented Apr 10, 2026

Uh oh!

migrau commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `panel_postprocessing_annotation.py` - Chunked VEP Output Processing

2. `panel_custom_processing.py` - Chromosome-Based Chunked Loading

FerriolCalvet left a comment •

edited

Loading